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Abstract 

The problem of computing an explicit isogeny between two given elliptic curves over ¥ q , 
originally motivated by point counting, has recently awaken new interest in the cryptology 
community thanks to the works of Teske and Rostovstev & Stolbunov. 

While the large characteristic case is well understood, only suboptimal algorithms 
are known in small characteristic; they are due to Couveignes, Lercier, Lercier & Joux 
and Lercier & Sirvent. In this paper we discuss the differences between them and run 
some comparative experiments. We also present the first complete implementation of 
Couveignes' second algorithm and present improvements that make it the algorithm 
having the best asymptotic complexity in the degree of the isogeny. 
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1. Introduction 



The problem of computing an explicit degree £ isogeny between two given elliptic 
curves ove r F Q wa s origina ll y moti vated by point counting methods based on Schoof's 
algorithm |Atk9l| . lElk98i. ISch95i . A review of the most efficient algorithms to solve 
this problem is given in " |BMSS08[ togethe r with a new quasi-optimal algorithm; however, 
all the algorithms presented in jBMSS08j are limited to the case I C p where p is the 
characteristic of ¥ q . This is satisfactory for cryptographic applications where one takes 
p = q or p = 2; indeed in the former case Schoof's algorithm needs £ G O(logp), while in 
the latter case there's no need to compute explicit isogenies since p-adic methods based 
on [SatOOl ] are preferred to Schoof's algorithm. 

Nevertheless, the problem of computing explicit isogenies in the case where p is small 
compared to £ remains of the or etical interest and can find practical applications in newer 
cryptosystems such as Tes06| . }RS06 |. The first algorit hm to solve this problem was given 
by Couveignes and made use of formal groups jCou94j ; it takes 0(£ 3 logg) operations in 
F p assuming p is constant, however it has an exponential complexity in log p. Another 
algorithm by Lercier specific to p = 2 uses some linear properties of the p roblem to build 
a linear system from whose solution the isogeny can be deduced Ler96l |: its complexity 
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is conjectur ed to b e 0(£ 3 \ogq) operations in F p , but it has a much better constant 
factor than Cou94l | . At the moment we write, the latter algorithm is by many orders of 



magnitude the fastest algorithm to solve practical instances of the problem when p = 2, 
thus being the de facto standard for cryptographic use. 

p-adic methods were used by Joux and Lercier jjL06| and Lercicr and Sirvent |LS09l ] 
to solve the isogeny problem. The former method has complexity 0(£ 2 (1 + £/p)\ogq) 
operations in F p , which makes it well adapted to the case where p ~ logg. The latter 
has complexity 0(£ 3 + £\ogq 2 ) operations in F p , making it the best algorithm to our 
knowledge for the case where p is not constant. 

The algorithm C2 and its variants. Finally, the alg orithm having the best asymptotic 
complexity in £ was proposed again by Couveignes in Cou96l | ; we will refer to t his origi nal 



version as Its complexity -supposing p is fixed- was estimated in jCou96l | as 

being 0{£ 2 logg) operations in F p , but with a precomputation step requiring 0(£ 3 logq) 
operations and large memory requirements. However, some more work is needed to 
effectively reach these bounds, while a straightforward implementation of C2 has an 
overall asymptotic complexity of Q(£ 3 logq) o perations, as we will argue in Section [31 

Subsequent work by Couveignes |CouOCll |. and more recently |DFS1C| . use Artin- 
Schrcicr theory to avoid the precomputation step of C2 and drop the memory require- 
ments to 0(£\ogq + log 2 q) elements of F p . However, this is still not enough to reduce 
the overall complexity of the algorithm, as we will argue in Section @] We refer to this 
variant as "C2-AS" . 

In the present paper we give a complete review of Couveignes' algorithm, we present 
new variants that reach the foreseen quadratic bound in I 2 and prove an accurate com- 
plexity estimate which doesn't suppose p to be fixed. We also run experiments to compare 
the performances of C2 with other algorithms. 

Notation and plan. In the rest of the paper p is a prime, d a positive integer, q = p d 
and W q is the field with q elements. For an elliptic curve E and a field K embedded in 
an algebraic closure K, we note by E{K) the set of K-rational points and by E[m) the 
m-torsion subgroup of E(K). The group law on the elliptic curve is noted additively, its 
zero is the point at infinity, noted O. For an affine point P we note by x(P) its abscissa 
and by y(P) its ordinate. We will restrict ourselves to the case of ordinary elliptic curves, 
thus E[p k ] ^ Z/p k Z. 

Unless otherwise stated, all time complexities will be measured in number of opera- 
tions in F p and all space complexities in number of elements of F p ; we do not assume p to 
be constant. We use the O, 9 and f2 notations to state respectively upper bounds, tight 
bounds and lower bounds for asymptotic complexities. We also use the notation O x that 
forgets polylogarithmic factors in the variable x, thus 0(xy log x logy) C O x (xy\ogy) C 
O x ,y{xy). We simply note O when the variables are clear from the context. 

We let 2 < oj ^ 3 be the exponent of linear algebra, that is an integer such that 
n x n matrices can be multiplied in n w operations. We let M : N — > N be a multiplica- 
tion function, such that polynomials of degree at most n with coefficients in F p can be 



multiplied in M(n) operations, under the conditions of [vzGGl . Ch. 8.3]. Typical orders 



*As opposed to the algorithm presented in [Cou94H . an algorithm "C2" shares many similarities with. 
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of magnitude are O(n log23 ) for Karatsuba multiplication or 0(n log n log log n) for FFT 
multiplication. Similarly, we let C : N — > N be the complexity of modular composition, 
that is a function such that C(n) is the number of field operations needed to compute 
fog mod h for /, g, h S ¥L[X] of degree at most n with coefficients in an arbitrary field 

The best known algorithm is BK78 |. this implies C(n) £ O (ji^^i 1 ^. Note that in a 



K 



boolean RAM model, the algorithm of |KU08l | takes quasi-linear time 



Organisation of the paper. In Scction[5]we give preliminaries on elliptic curves and isoge- 
nics. In Sections [3] through |5] we develop the algorithm C2 and we incrementally improve 
it by giving a new faster variant in each Section. Section IT] give s technical details on our 
implementations of the algorithms of this paper and of [LS09j . Finally in Section [8] we 
comment the results of the experiments we ran on our implementations. 



2. Preliminaries on Isogenies 

Let E be an ordinary elliptic curve over the field F^. We suppose it is given to us as 
the locus of zeroes of an affine Weicrstrass equation 

y 2 + a\xy + a^y = x 3 + a-ix 2 + a±x + a\, . . . ,ag e ¥ q . 

Simplified forms. If p > 3 it is well known that the curve E is isomorphic to a curve in 
the form 

y 2 = x 3 + ax + b (1) 

and its j-invariant is j(E) = 16 1 ( ™ ( ^ b2) ■ 

When p = 3, since E is ordinary, it is isomorphic to a curve 

y 2 = x 3 + ax 2 + b (2) 
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and its j-invariant is j(E) = — 

Finally, when p = 2, since E is ordinary, it is isomorphic to a curve 

y 2 + xy = x 3 + ax 2 + b (3) 

and its j-invariant is j (E) = -r . 

These isomorphism are easy to compute and we will always assume that the elliptic 
curves given to our algorithms are in such simplified forms. 

Isogenies. Elliptic curves are endowed with the classic group structure through the chord- 
tangent law. A group morphism having finite kernel is called an isogeny. Isogenies are 
regular maps, as such they can be represented by rational functions. An isogeny is said 
to be K-rational if it is K-rational as regular map; its degree is the degree of the regular 
map. 

One important property about isogenies is that they factor the multiplication-by-m 
map. 

Definition 1 (Dual isogeny). Let X : E — > E' be a degree m isogeny. There exists an 
unique isogeny X : E 1 E, called the dual isogeny such that 

XoX=[m]E and XoX=[m]E> 
3 
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Figure 1: Factorization of an isogeny. X' has kernel E[m] ffikerX. 



As regular maps, isogenies can be separable, inseparable or purely inseparable. In 
the case of finite fields, purely inseparable isogenies are easily understood as powers of 
the frobenius map. Let 

E {p) : y 2 + a\xy + a%y = x 3 + a p 2 x 2 + a%x + a% 

then the map 

cj) : E -> E {p) 
(x,y) ^ (x p ,y p ) 

is a degree p purely inseparable isogeny. Any purely inseparable isogeny is a composition 
of such frobenius isogenics. 

Let E and E' be two elliptic curves defined over F g , by finding an explicit isogeny we 
mean to find an (F g -rational) rational function from E(¥ q ) to E'(¥ q ) such that the map 
it defines is an isogeny. 

Since an isogeny can be uniquely factored in the product of a separable and a purely 
inseparable isogeny, we focus ourselves on the problem of computing explicit separable 
isogenies. Furthermore one can factor out multiplication-by-m maps, thus reducing the 
problem to compute explicit separable isogenies with cyclic kernel (see figure 

In the rest of this paper, unless otherwise stated, by ^-isogeny we mean a separable 
isogeny with kernel isomorphic to 7Lj 17L. 



Velu formulae. For any finite subgroup G C E(K), Velu formulae Vel71j give in a 
canonical way an elliptic curve E and an explicit isogeny X : E — > E such that kerl = G. 
The isogeny is K-rational if and only if the polynomial vanishing on the abscissae of G 
belongs to K[X}. 

In practice, if E is defined over V q and if 

h(X)= U(X-x(P))e¥ q [X] 

PEG 

is known, Velu formulae compute a rational function 

^,y) = ( g M k -W) (4) 



h{x)' l(x) 

and a curve E such that I : E — > E is an F 9 -rational isogeny of kernel G. A consequence 
of Velu formulae is 

deg 5 = deg/ l + I = #G. (5) 
4 
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Figure 2: Using Velu formulae to compute an explicit isogeny. 



Given two curves E and E' , Velu formulae reduce the problem of finding an explicit 
isogeny between E and E' to that of finding the kernel of an isogeny between them. 
Once the polynomial h(X) vanishing on kerX is found, the explicit isogeny is computed 
composing Velu formulae with the isomorphism between E and E' as in figured! 

3. The algorithm C2 



The algorithm we refer to as C2 was originally proposed in |Cou 96] . It takes as 



input two elliptic curves E, E' and an integer £ prime to p and it returns, if it exists, an 
F g -rational isogeny of degree £ between E and E' . It only works in odd characteristic. 

3.1. The original algorithm 

Suppose there exists an F g -rational isogeny I : E — >• E' of degree i. Since £ is prime 
to p one has I(E[p k ]) = E'[p k ] for any k. 

Recall that E[p k ] and E'[p k ] are cyclic groups. C2 iteratively computes generators 
Pk,P' k of E[p k ] and E'[p k ] respectively. Now C2 makes the guess l(Pk) = P^, then, if T 
is given by rational fractions as in @, 

ffinpfl =xi ^ Pk) fOT ^z//z (6) 

h(x([i\P k )) 

and by (|S|) deg g = deg h + 1 = I. 

Using ([6]) one can compute the rational fraction through Cauchy interpolation 

over the points of E[p k ] for k large enough. C2 takes p k > A£ — 2, interpolates the rational 
fraction and then checks that it corresponds to the restriction of an isogeny to the x- 
axis. If this is the case, the whole isogeny is computed through Velu formulae and the 
algorithm terminates. Otherwise the guess I(Pfe) = P' h was wrong, then C2 computes a 
new generator for E'[p k ] and starts over again. 

We now go through the details of the algorithm. 

The p-t orsion. The computation of the p-torsion points follows from the work of Gunji 



Gun76l | . Here we suppose p ^ 2. 

Definition 2. Let E have equation y 2 — f(x). The Hasse invariant of E, noted He, is 
the coefficient of V p_1 in f(X) R 2~. 

Gunji shows the following proposition and gives formulae to compute the p-torsion 
points. 
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Proposition 3. Let c = p \/He; then, the p-torsion points of E are defined in F g [c] 
and their abscissae are defined in ¥ q [c 2 ] . 

The p k -torsion. p fc -torsion points are iteratively computed via p-descent. The basic idea 
is to split the multiplication map as [p] = <j) o V and invert each of the components. 
The purely inseparable isogeny </> is just a frobenius map and the separable isogeny V 
can be computed by Velu formulae once the p-torsion points are known. Although this 
is reasonably efficient, pulling V back may involve factoring polynomials of degree p in 
some extension field. 



A finer way to do the p-descent, as suggested in the original paper [Cou96|, is to use 
the work of Voloch Vol9Clj . Suppose p 7^ 2, let E and E have equations respectively 



set 



y 2 = f{x) = x 3 + a 2 x 2 + a 4 x + a 6 , 

y 2 = f{x) = x 3 + tfa^x 2 + tfaix + tfa^ 



f{Xf£ = a(X) + H^X"- 1 + X'Pp(X) (7) 



with dcga < p — 1 and -ffg the Hasse invariant of E. Voloch shows the following 
proposition. 

Proposition 4. Let c = p ~^/i?g, the cover of E defined by 

C: zP-z=M& ( 8 ) 

is an etale cover of degree p and is isomorphic to E over ¥ q [c] ; the isomorphism is given 
by 

' (x,y) = V(x,y) 



p-i 



1 (9) 



where P\ is a primitive p-torsion point of E . 

The descent is then performed as follows: starting from a point P on E, first pull it 
back along <p, then take one of its pre- images in C by solving equation ([8]). finally use 
equation (|9|) to land on a point P' in E. The proposition guarantees that [p]P' = P. 
The descent is pictured in figure [3] 

The reason why this is more efficient than a standard descent is the shape of equation 
([5]): it is an Artin-Schreicr equation and it c an be s olved by many techniques, the simplest 



being linear algebra (as was suggested in Cou96|). Once a solution z to ((5J is known, 



solving in x and y the bivariatc polynomi al syste m (O takes just a GCD computation 
(explicit formulae were given by Lercier in (Ler97 . §6.2], we give some slightly improved 



ones in Section [7|)- Compare this with a generic factoring algorithm needed by standard 
descent. 

Solving Artin-Schreier equations is the most delicate task of the descent and we will 
further discuss it. 
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Figure 3: Two ways of doing the p-descent: standard on the left and via a degree p cover on the right 



Cauchy interpolation. Interpolation reconstructs a polynomial from the values it takes 
on some points; Cauchy interpolation reconstructs a rational fraction. The Cauchy in- 
terpolation algorithm is divided in two phases: first find the polynomial P interpolating 
the evaluation points, then use rational fraction reconstruction to find a rational fraction 
congruent to P modulo the polynomial vanishing on the points. The first phase is car- 
ried out through any clas sical int erpolation algorithm, while the second is similar to an 



XGCD computation. See [vzGQ . §5.8] for details. 

Cauchy interpolation needs n + 2 points to reconstruct a degree (k, n — k) rational 
fraction. This, together with ([S]), justifies the choice of k such that p k > Al— 2. Some of 
our variants of C2 will interpolate only on the primitive p fe -torsion points, thus requiring 
the slightly larger bound 4>(p k ) M—2. This is not very important to our asymptotical 
analysis since in both cases p k £ 0(£). 

Recognising the isogeny. Once the rational fraction |»S has been computed, one has to 
verify that it is indeed an isogeny. The first test is to check that the degrees of g and 
h match equation ([5|), if they don't, the equation can be discarded right away and the 
algorithm can go on with the next trial. Next, one can check that h is indeed the square 
of a polynomial (or, if £ is even, the product of one factor of the 2-division polynomial 
and a square polynomial). This two tests are usually enough to detect an isogeny, but, 
should they lie, one can still check that the resulting rational function is indeed a group 
morphism by trying some random points on E. 

3.2. The case p = 2 

The algorithm as we have presented it only works when p ^ 2, it is however an easy 
matter to generalise it. The only phase that doesn't work is the computation of the 
p fe -torsion points. For curves in the form ([3]) the only 2-torsion point is (0, y/b). 

Voloch formulae are hard to adapt, nevertheless a 2-descent on the Kummer surface 
of E can easily be performed since the doubling formula reads 

*^-^ + ^-*(^P)-*° r - (10) 

Given point ip on Ke, a pull-back along <fi gives a point xp on K^. Then pulling V 
back amounts to solve 

x 2 + x P x = Vb (11) 

and this can be turned in an Artin-Schreier equation through the change of variables 
x —t x'xp. 

7 



From the descent on the Kummer surfaces one could deduce a full 2-descent on the 
curves by solving a quadratic equation at each step in order recover the y coordinate, 
but this would be too expensive. Fortunately, the y coordinates are not needed by the 
subsequent steps of the algorithm, thus one may simply ignore them. Observe in fact 
that even if Ke does not have a group law, the restriction of scalar m ultiplication is well 



defined and can be computed through Montgomery formulae [Mon87l | . This is enough to 
compute all the abscissae of the points in £'[p fe ] once a generator is known. 

3.3. Complexity analysis 

Analysing the complexity of C2 is a delicate matter since the algorithm relics on some 
black-box computer algebra algorithms in order to deal with finite extensions of ¥ q . The 
choice of the actual algorithms may strongly influence the overall complexity of C2. In 
this section we will only give some lower bounds on the complexity of C2, since a much 
more accurate complexity analysis will be carried out in Sectional 

p-torsion. Applying Gunji formulae first requires to find c and c', p— 1-th roots of He and 
He> , and build the field extension ¥ q [c] = ¥ q [c'] . Independently of the actual algorithm 
used, observe that in the worst case ¥ q [c] is a degree p — 1 extension of ¥ q , thus simply 
representing one of its elements requires <d(pd) elements of ¥ p . 

Subsequently, the main cost in Gunji's for mulae is t he computation of the determinant 
of a 2^ x quadri-diagonal matrix (see |Gun76l ] ) . This takes 6(p 2 ) operations in 
¥ q [c] by Gauss elimination, that is no less than £l(p 3 d) operations in F p . 

p k -torsion. During the p-descent, factoring of equations ([5]) or pip may introduce some 
field extensions over F g [c]. Observe that an Artin-Schreier polynomial is cither irreducible 
or totally split, so at each step of the p-descent we either stay in the same field or we take 
a degree p extension. This shows that in the worst case, we have to take an ex tension 



of degree p k 1 over F q [c] . The following proposition, which is a generalisation of Ler97 
Prop. 26], states precisely how likely this case is. 

Proposition 5. Let E be an elliptic curve over¥ q , we noteUi the smallest field extension 
of¥ q such that E[p l ] C E(Vi). For any i ^ 1, either [U i+ i : ILJ = p or U i+1 = Uj = 
■■■ =Ul 

Proof. Observe that the action of the Frobenius <fi on E[p] is just multiplication by the 
trace t, in fact the equation 

(j) 2 — [t mod p] o + [q mod p] = 

has two solutions, namely [t mod p) and [0 mod p), but the second can be discarded since 
it would imply that (f> has non-trivial kernel. By lifting this solution, one sees that the 
action of (f> on the Tate module T P {E) is equal to multiplication by some r € Z p . 

Note G the absolute Galois group of ¥ q , there is a well known action of G on T P (E). 
Since G is generated by the Frobenius automorphism of ¥ q , the restriction of this action 
to E[p k ] is equal to the action (via multiplication) of the subgroup of (z/p k zy generated 
by ifc — t mod p k . Hence [Ufc : ¥ a ] = ord(rfe). 

Then, for any k > 1, Ler97l Corollary 4] applied to r^+i = r mod p k+l shows that 



ord(rfc + i) = ord(rfe) implies ord(rfc) = ord(rjt_i) and this concludes the proof. □ 



Thus for any elliptic curve there is an io such that [Uj : Ui] = p' l ~ l ° for any i ^ ig. 
This shows that the worst and the average case coincide since for any fixed curve [Ufe : 
Ui] G @(p k ) asymptotically. In this situation, one needs Q(p k d) elements of F p to store 
an element of Ufe . 

Now the last iteration of the p-descent needs to solve an Artin-Schreier equation in 
Ufe. To do this C2 precomputes the matrix of the F g -linear application (x q — x) : Ufe — >• Ufe 
and its inverse, plus the matrix of the F p -linear application {x p — x) : ¥ q — > ¥ q and its 
inverse. The former is the most expensive one and takes Q(p uk ) operations in ¥ q , that 
is Q,(p" k d) = Q,(l u d) operations in ¥ p , plus a storage of <d(£ 2 d) elements of ¥ p . Observe 
that this precomputation may be used to compute any other isogeny with domain E. 

After the precomputation ha s been done, C2 successively applies the two inverse 
matrices; details can be found in jCou96l §2.4]. This costs at least Q(£ 2 d). 

Interpolation. The most expensive part of Cauchy interpolation is the polynomial inter- 
polation phase. In fact, simply representing a polynomial of degree p k — 1 in Ufc[X] takes 
<d(p 2k d) elements, thus at least fl(£ 2 d) operations are needed to interpolate unless special 
care is taken. This contri bution d ue to arithmetics in Ufe had been underestimated in 
the complexity analysis of Cou96j . which gave an estimate of Q(£dlog£) operations for 



this phase. We will give more details on interpolation in Section [5] 

Recognising the isogeny. The cost of testing for squareness of the denominator and other 
tests is negligible compared to the rest of the algorithm. Nevertheless it is important to 
realize that on average half of the 4>(p k ) mappings from E[p k ] to E'[p k ] must be tried 
before finding the isogeny, for only one of these mappings corresponds to it. This implies 
that the Cauchy interpolation step must be repeated an average of Q(p k ) times, thus 
contributing a f2(£ 3 d) to the total complexity. 

Summing up all the contributions one ends up with the following lower bound 

n(£ 3 d + p 3 d) (12) 

plus a precomputation step whose cost is negligible compared to this one and a space 
requirement of Q(£ 2 d) elements. In the next sections we will see how to make all these 
costs drop. 



4. The algorithm C2-AS 

One of the most expensiv e steps of C2 is the resolution of an Artin-Schreier equation 
in an extension field TLX;. In Cou00| | Couveignes gives an approach alternative to linear 



algebra to solve this problem. First it builds the whole tower (Ui = ¥ q [c] , . . . , Ufe) of 
intermediate extensions, then it solves an Artin-Schreier equation in Uj recur sively b y 
reducing it to another Artin-Schreier equation in Uj. Details are in [CouOflj and [DFSlOl ]. 



To solve the final Artin-Schreier equation in Ui = ¥ q [c] one resorts to linear algebra, 
thus prccomputing the inverse matrix of the F p -lincar application (x p — x) : Ui — > Ui. 

4-1. Complexity analysis 

How effective this method is depends on the way algebra is performed in the tower 
(Ui, . . . , Ufe). The present author and Schost DFSlOl ] recently presented a new construc- 
tion based on Artin-Schreier theory that allows to do most arithmetic operations in the 
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tower in quasi-linear time. Assuming this construction is used, we can now give precise 
bounds for each step of C2-AS. 

p-t orsion. The construction of ¥ q [c] may be done in many ways. The only requirements 
of |DFS10j are 



1. that its elements have a representation as elements of F p [X]/Qi(X) for some irre- 
ducible polynomial Qi, 

2. that either (d,p) = 1 or degQ^ + 2 = degQi. 

Selecting a random polynomial Q\ and testing for irrcducibility is usually enoug h to 



meet these conditions. This costs 0(pdM(pd) \og(pd) \og(p 2 d)) according to vzGGl . Th. 
14.42]. 

Now we need to compute the embedding ¥ q C F g [c]. Supposing F g is represe nted as 



W p [X]/Qq(X), we factor Q in F g [c], which costs 0(pdM(pd 2 ) log dlogp) using [vzGG 



Coro. 14.16]. Then the most naive technique to express the embedding is linear algebra. 
This requires the computation of pd elements of ¥ q [c] at the expense of Q(pdM(pd)) 
operations in F p , then the inversion of the matrix holding such elements, at a cost of 
0((pd)") operations. This is certainly not optimal, yet this phase will have negligible 
cost compared to the rest of the algorithm. 

Now we can compute c and d by factoring the polynomials Y p ~ l — He and Y p ~ 1 — He' 
in F p [X]/Qi(X). This costs 

O ( (pC {pd) + C(p)M (pd) + M(p)M (pd) log p) (log 2 p + log d) ) 



using 



KS97I . Section 3]. 

Finally, computing the determinants needed by Gunji's formulae takes Q(p 2 ) multi- 
plications in F g [c], that is Q(p 2 M(pd)) . 

Letting out logarithmic factors, the overall cost of this phase is 

6(p 2 d 3 +pC(pd) + C(p)pd+(pd)") (13) 

p k -torsion. Application of Voloch formulae requires at each of the levels U2, ■ ■ ■ , Ufe 

1. to solve equation ([8]) by factoring an Artin-Schreier polynomial, 

2. to solve the system 

If we assume the worst case [Uq : Ui] = p. according to [PFSldl . Th. 13], at each level i 
the first step costs 

0((pd)"i + PT(i - 1) + M( P l+1 d) \og P ) 

where PT(«) = O ((pi + log(d))il(i) + p l C(pd) log 2 (pd)) 
and L(i) = 0(p i+2 d\og 2 pP i+1 d + pU(p l+1 d)) ; 
while the second takes the GCD of two degree p polynomials in U,[X] for each i (sec 



Section [7]), at a cost of 0(M(p l+1 d) logp) operations using a fast algorithm [vzGGl . §11.1]. 



Summing up over i, the total cost of this phase up to logarithmic factors is 

d p , d , loge ^(pdriog 2 p £ + p 2 mog p £+^C(pd)^j . (14) 

Also notice that there is no more need to store a p k ~ 1 d x p k ~ 1 d matrix to solve the 
Artin-Schreier equation, thus the space requirements are not anymore quadratic in I. 
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Interpolation. The interpolation phase is not essentially changed: one needs f irst to 
interpolate a degree p k — 1 polynomial with coefficients in then use |dfsk1 p ush- 
down] to obtain the corresponding polynomial in F g [A] and finally do a rational fraction 
reconstruction. 

The first step costs 0(M(p 2k d) logp fc ) using fast techniques as |vzGG . §10.2], then 



converting to F,[c][X] takes 0(p k l(k - 1)) by [PFSlfjl ] and further converting to ¥ q [X] 
takes 0((pd) 2 ) by linear alg ebra. T he rational function reconstruction then takes 0(M(p k d) logp k ) 
using fast GCD techniques [yz cel . §ii.i]. 

The overall complexity of one interpolation is then 

o (m (e 2 d) io gp i + ei(k - 1) + (pd) 2 ) . (15) 

Remember that this step has to be repeated an average number of (f>(p k )/4 times, thus 
the dependency of C2-AS in I is still cubic. 



5. The algortihm C2-AS-FI 

The most expensive step of C2-AS is the polynomial interpolation step which is part 
of the Cauchy interpolation. If we use a standard interpolation algorithm, its input 
consists in a list of Q(p k ) pairs (P,I(P)) with P S Vk, thus a lower bound for any such 
algorithm is fl(p 2k d). Notice however that the output is a polynomial of degree <d(p k ) 
in Fq[X], hence, if supplied with a shorter input, an ad hoc algorithm could reach the 
bound Vt(p k d). 

In this section we give an algorithm that reaches this bound up to some logarithmic 
factors. It realizes the polynomial interpolation on the primitive points of E[p k \, thus 
its output is a degree 4>{p k )/2 — 1 polynomial in F q [X]. Using the Chinese remainder 
theorem, it is straightforward to generalise this to an algorithm having the same asymp- 
totic complexity realizing the polynomial interpolation on all the points of E[p k ]. Wc 
call C2-AS-FI the variant of C2-AS resulting from applying this new algorithm. 



5.1. The algorithm 

We set some notation. Let io be the largest index such that Ui = Ui and let 

2c ' 



£ — ^ — [F„[c 2 ] : F„]. For notational convenience, wc set Uo = F 



We note T(X) the polynomial vanishing on the primitive points of E[p k ] and 

T = \{Tf (16) 

its factorisation over Uj; we remark that all the T^'s have degree 2 ° — -. We also 
note A(X) the goal polynomial and 

Af = A mod r, W . (17) 



It was already pointed out in |Cou96 . §2.3] that if all the Aj's are known one can 



recover A using the Chinese remainder theorem. If we chose any point P such that 
t!> 0) (x{P)) = and fix the embedding 

F,W / T <o) (x) <-^U, (18) 
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given by b(X) = x(P), then it is evident that i(Aj (X)) = x(l(P)), thus in order to 

compute Aj 3 ^ one just needs to compute C 1 (x(I(P))) . 

Unfortunately, the information needed to compute t was lost in the p-descent, for we 
don't even know the T^ h s. None of the algorithms of |DFSld | helps us to compute such 
information and straightforward computation of it would be too expensive. The solution 
is to decompose i as a chain of morphisms and invert them o ne-by-one going down in 
the tower (Uo,Ui, . . .,Ufc), this is similar to the way [CouOOt solves an Artin-Schreier 
equation by moving it down from Ufe to Ui. 

The moduli. We first need to compute T^ G iJi[X] for any i. For this we fix a primitive 

point P G -E[p fc ] and we reorder the indices in (|16[) so that is the minimal polynomial 
of x(P) over Uj. 

The first minimal polynomial is simply 

4 k} (X) = X-x(P). (19) 

Now suppose we know Tq +1 \ then a generator a of Gal(Uj+i/Ui) acts on the roots of 
Tq 1+1 ^ sending them on the roots of some Tj +1 \ Then the minimal polynomial of x{P) 
over Ui is 

Tf = [] CT ( T o i+1) ) ■ (20) 

creGal(U I + i/Ui) 

Some care has to be taken when computing Tq : in fact the abscissae of the points may 
be counted twice if c g" F 9 [c 2 ]. In this case only a subgroup of index 2 of Gal(Ui/Uo) 
must be used instead of the whole group. 

The interpolation. The computation of Aq is done in the same recursive way. Fix the 

(i) 

same point P used to compute the T 's and fix the chain of embeddings 

'tW(Xo) *~ <T ( \ k) (X k ) l^-LJ 

given by tfc o • • • o Li(Xi) = x(P) for any i. 

We compute A^ by inverting the chain: inverting simply gives 

4 fe) =x(J(P)). (22) 

Then suppose we know Aq 1 , and decompose the embedding tj as 

Ui[Xil / ToW( x i )^ Ui+l[Xi+I V Tri ) ( ^ +l) (23) 



where e is the canonical injection extending U< C Uj+i, 7 is the Chinese remainder 
isomorphism and 7r is projection onto the first coordinate. 
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To invert it observe that any a € Gal(U;+i/Ui) leaves invariant while it permutes 

(i+l) 



the moduli Tf^' , thus 



A 



mod 



(24) 



Hence we can obtain all the Aj through the action of GalfUj +i/Uj) on Aq 



(i+l) 



Then we can invert 7 through a Chinese remainder algorithm [vzGGl . §10.3] and e by 
converting coefficients from U;+i to U$. 

As for the moduli, a special treatment is needed for tp if c ^ F g [c 2 ]. 



5.2. Complexity analysis 

The two algorithms for computing the 's and the A^ 's are very similar and run 
in parallel. We can merge them in one unique algorithm, at each level i ^ io it does the 
following 

1. for a e Gal(U,+i/Uj), call a the permutation it induces on the indices of the 



if +1) 's, compute 



j 



(a) T<$> 

(b) 4^ 



-.(i+l) 



o~ [ A 



and 



using 



DFSld . IterFrobenius] 



compute Tq 1 ^ through a subproduct tree as in [vzGGl . Algo. 10.3], 
3. compute A^ through Chinese Reminder Algorithm vzGG . Algo. 10.16], 



4. convert T Q W and A$ into elements of Vi[X] using |PFS10l . Push-down]. 

Steps [Ta| and llbl are identical. Both a re repeated p times, each iteration taking 
0(p k - l L(i - t )) C 0(L(fc - i )) by [DFSlOl, Th. 17] . 

SteplDtake s O(M( p fc ~' 0+1 d/r) logp) by |vzGGl . Lemma 10.4] and stcp|3]has the same 
complexity by (vzGGl . Coro. 10.17]. 

StepHtakes 0(p k - i+1 L(i - i )) C 0(pl(k - i )). 

When i = and Ui 7^ ¥ q the algorithm is identical b ut steps [Tal and [Tb] must 
be computed through a generic frobenius algorithm (using jvzGS92 . Algorithm 5.2], 
for example) and step 0] must use the implementation of F q [c] to make the conversion 
(for example, linear algebra). In this case steps [Tal and [TBI cost 6( p ° C(pd) log d) 
by [vzGS92l Lemma 5.3] and step H costs Q(p k - l ° (pd) 2 ) . 

The total cost of the algorithm is then 

O (k -i )(pL(k-i ) + M(p k - to+1 d/r) logp) + il_(C (pd) log d + r(pd) 2 ) 



After all, the whole algorithm looks a lot like fast interpolation [vzGGl. §10] a nd it is 
indeed a modified version of it. A similar algorithm was already given in |EM03| . 
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The complete interpolation. We compute all the A"- s using this algorithm; there's 
pio-i r f them. We then recombine them through a Chinese remainder algorithm at 
a cost of 0(M(p k d) logp l0_1 r). The total cost of the whole interpolation phase is then 

O ((k - i ) (pl(k) + M(p k d) \ogp) + p k - l C(pd) \ogd + p k - 1 r{pdf + i M(p k d) \ogp) , 
that is 

0\pL(k)log(J^j +M(£d)log£logp+^C(pd)\ogd + e(pd) 2 ^j . (25) 
Alternatively, once is known, one could compute the other 's using m odular 



composition with the multiplication maps of E and E' as suggested in [Cou96( . How 



ever this approach doesn't give a better asymptotic complexity because in the worst 
case A^ = A. From a p ractica l point of view, though, Brent's and Kung's algorithm 
for modular composition [BK78j . despite having a worse asymptotic complexity, could 
perform faster for some set of parameters. We will discuss this matter in Section [51 

If more than 0(p fc )/2 points are needed, but less than one can use the previous 
algorithm to compute all the polynomials Ai interpolating respectively over the p I -torsion 
points of E and E' . They can then be recombincd through a Chinese remainder algorithm 
at a cost of 0(M(p k d) logp k ), which doesn't change the overall complexity of C2-AS-FI. 

Putting together the complexity estimates of C2-AS and C2-AS-FI, we have the 
following. 

Theorem 1. Assuming M(n) = nlognloglogn, the algorithm C2-AS-FI has worst case 
complexity 

O p , d , loe e (p 2 d 3 + C(p)pd+ (pdr log 2 e + p 3 e 2 d\og 3 £ + p 2 fd 2 + +p^j C(pd^j . 



6. The algorithm C2-AS-FI-MC 

However asymptotically fast, the polynomial interpolation step is quite expensive for 
reasonably sized data. Instead of repeating it times, one can use composition with 

the Frobenius endomorphism 4>e in order to reduce the number of interpolations in the 
final loop. 

6.1. The algorithm 

Suppose we have computed, by the algorithm of the previous Section, the polynomial 
T vanishing on the abscissae of E[p k ) and an interpolating polynomial Aq £ V q [X] such 
that 

A (x([n]P)) = x([n]P') for any n. 

The group Gal(Ufe/F g ) = lip) acts on E'[p k ] permuting its points and preserving the 
group structure. Thus, the polynomial 

A\ = Aq o if = ip o Ao 

is an interpolating polynomial such that 



^i(x([n]P)) = x([n}<f> E >(P')) for any n, 
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where 4>e> is the Frobenius endomorphism of E 1 . Since 4>e' {P') is a generator of E'[p k ], A\ 
is one of the polynomials that the algorithm C2 tries to identify to an isogeny. By iterating 
this construction we obtain [Ufc : F 9 ]/2 different polynomials Ai for the algorithm C2 
with only interpolation. 

To compute the A^s, we first compute F £ ¥ q [X] 

F{X) = X q modT(X), (26) 

then for any 1 < i < [U fc : ¥ q }/2 

Ai(X) = Ai-^X) o F(X) mod T(X). (27) 

If = p l0_1 T, we must compute p l °~ l r polynomial interpolations and apply this 

algorithm to each of them in order to deduce all the polynomials needed by C2. 



6.2. Complexity analysis 

We compute (f!2"6")) via square-and-multiply, this costs Q(dM(p k d) \ogp) operations. 
Each application of (|27[) is done via a modular composition, the cost is thus Q (C(jP fc )) 
operations in F g , that is 0(C(p k )M(d)) operations in ¥ p . Using the algorithm of KU08 | 
for modular composition, the complexity of C2-AS-FI-MC wouldn't be essentially dif- 
ferent from the one of C2-AS-FI; however, in practice th e faste st algorithm for modular 
composition is BK78 |. and in particular the variant in [KS981 Lemma 3], which has a 
worse asymptotic complexity, but performs better on the instances we treat in Section [5] 

Notice that a similar approach could be used inside the polynomial interpolation step 
(see Section [5} to deduce A^ from A^ using m odular composition with the multiplica- 
tion maps of E and E' as described in Cou96l §2.3]. This variant, though, has an even 
worse complexity because of the cost of computing multiplication maps. 



7. Implementation 

We implemented C2-A S-FI-M C as C++ programs using the libraries NTL NTli for fi- 
nite field arithmetics, gf 2x (gf2xj for fast arithmetics in characteristic 2 and FAAST [DFSld ] 
for fast arithmetics in Artin-Schreier towers. 

This section mainly deals with some tricks we implemented in order to speed up the 
computati on. At th e end of the section we br iefly discuss the implementation we made 
in Magma Magma of the algorithm in |LS09| . 



7.1. Building E[p k ] and E'[p k ] 

p-torsion. For p ^ 2, C2 and its variants require to build the extension F 9 [c] where c is 
a p — 1-th root of He- In order to deal with the lowest possible extension degree, it is a 
good idea to modify the curve so that \F q [c] : ¥ q ] is the smallest possible. 

[Fq[c] : Wq] is invariant under isomorphism, but taking a twist can save us a quadratic 
extension. Let u = c~ 2 , the curve 

E : y 2 = x 3 + a2ux 2 + a^u 2 x + a^u 3 

is defined over F 9 [c 2 ] and is isomorphic to E over F g [c] via (x, y) i— > (y/u 2 x , y/u 3 y) . Its 
Hasse invariant is H E = (u) E 2~ H E = 1, thus its p-torsion points are defined over F g [c 2 ]. 
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In order to compute the p^-torsion points of E we build F 9 [c 2 ], we compute P a 
p fe -torsion points of E using p-descent, then we invert the isomorphism to compute the 
abscissa of P € E[p k ]. Since the Cauchy interpolation only needs the abscissae of E[p k ], 
this is enough to complete the algorithm. Scalar multiple s of P can be computed without 
knowledge of y(P) using Montgomery formulae [Mon87| . 

Remark that for p — 2 we use the same construction in an implicit way since we do 
a p-descent on the Kummer surface. 

p k -torsion points. For p 7^ 2 we use Voloch's p-descent to compute the p fe -torsion points 
iteratively as described in Secti on EH To fac tor the Artin-Schreier polynomial ([5]), we 
use the algorithms from CouOC| and DFSlOl ] that were analysed in Section [U All these 
algorithms were provided by the library FAAST. 
To solve system ([9]) we first compute 



V(x,y) 



9{x) ( g{x) \ 
,sy 



h 2 {: 



h 2 {x) 



through Velu formulaeil Recall that we work on a curve having Hasse invariant 1, system 
© can then be rewritten 



h 2 (a 



X = 
Y = sy 
Z = -2y 



9{x) \ 
h 2 {x) 
h'(x) 
h(x) 



where (X, Y, Z) is the point on the cover C that we want to pull back. After some 
substitutions this is equivalent to 



Xh 2 (x) -g(x) = 
= 



Xh 2 (x)-g(x)-^-h 2 (x) 
sZ 

Then a solution to this system is given by the GCD of the two equations. Remark 
that proposition EH ensure s there is one unique solution. This formulae are slightly more 
efficient than the ones in (Ler97 . §6.2]. 

For p — 2 we use the library FAAST (for solving Artin-Schreier equations) on top of 
gf 2x (for better performance). There is nothing special to remark about the 2-descent. 



7.2. Cauchy interpolation and loop 

The polynomial interpolation step is done as described in Section [5J As a result of 
this implementation, the polynomial interpolation algorithm was added to the library 
FAAST. 



2 Velu formulae compute this isogeny up to an indeterminacy on the sign of the ordinate, the actual 
value of s must be determined by composing V with <j> and verifying that it corresponds to [p] by trying 
some random points. 
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The rational fraction reconstruction is implemented using a fast XGCD algorithm on 
top of NTL and gf 2x. This algorithm was added to FAAST too. 

The loop uses modular composition as in Section|B]in order to minimise the number of 
interpolations. The timings in the next section clearly show that this non-asymptotically- 
optimal variant performs much faster in practice. 

To check that the rational fractions are isogenics we test their degrees, that their 
denominator is a square and that they act as group morphisms on a fixed number of 
random points. All these checks take a negligible amount of time compared to the rest 
of the algorithm. 

7.3. Parallelisation of the loop 

The most expensive step of C2-AS-FI-MC, in theory as well as in practice, is the final 
loop over the points of E'[p k ]. Fortunately, this phase is very easy to parallelise with 
very few overhead. 

Let n be the number of processors we wish to parallelise on, suppose that [Ufc : ¥ q ] 
is maximal, then we make only one interpolation followed by <fi(p k )/2 modular composi- 

tionsH We set m = ^ n ^ j an d we compute the action of (p m on E[p k ] as in Section |6j 

F [m \X) = F(X) o • • ■ o F(X) mod T(X) , 

this can be don e with Q (logm) modular compositions via a binary square- and- multiply 
approach as in |vzGS9l Algorithm 5.2]. 
Then we compute the n polynomials 

A mi {X) = A m( i_ 1} (X) o F (m \X) mod T(X) 

and distribute them to the n processors so that they each work on a separate slice of 
the Ai's. The only overhead is 9(log(£/n)) modular compositions with coefficients in ¥ q , 
this is acceptable in most cases. 



7-4- Implementation of 



In order to compare our imple menta tion with the state-of-the-art algorithms, we 



implemented a Magma prototype of LS09ll; in what follows, we will refer to this algorithm 



as LS. The algorithm generalises by lifting the curves in the p-adics to avoid 



divisions by zero. Given two curves E and E' and an integer t, it performs the following 
steps 

1. Lift E to E in Q q , 

2. Lift the modular polynomial $^ to in Q q , 

3. Find a root in Q q of jg) that reduces to je 1 in ¥ q , 



4. Apply [BMSS08] in Q q to find an isogeny between E and E' , 



5. Reduce the isogeny to ¥ q 



3 If [Ufc : Wq] is not maximal, the parallelisation is straightforward as we simply send one interpolation 
to each processor in turn. 
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Figure 4: Comparative timings for different implementations of C2-AS-FI-MC with curves defined over 
F 2 ioi • Plot in logarithmic scale. 



We implemented this algorithm using Magma support for the p-adics. Instead of the 
classical modular polynomials $^ we used Atkin's canonical polynomials $^ since they 
have smaller coefficients and degree; this does not change the other steps of the algorithm. 
The modular polynomials were taken from the tables precomputed in Magma. 

The bottleneck of the algorithm is the use of the modular polynomial as its bit size 
is 0(£ 3 ), thus LS is asymptotically worse in £ than C2. However the next section will 
show that LS is more practical than C2 in many circumstances. 



8. Benchmarks 

We ran various experiments to compare the different variants of the algorithm C2 
between themselves and to the other algorithms. All the experiments were run on four 
dual-core Intel Xeon E5430 (2.6GHz), eventually using the parallelised version of the 
algorithm. 

The fi rst set o f experiments was run to evaluate the benefits of using the fast algo- 
rithms in [DFSlOl ]. We selected pairs of isogenous curves over F 2 ioi such that the height 



of the tower is maximal (observe that this is always the case for cryptographic curves). 
The library FAAST offers two types for finite field arithmetics in characteristic 2: zz_p 
which is a generic type for word-precision p and GF2 which uses the optimised algorithms 
of the library gf 2x. We compared implementations of C2-AS-FI-MC using these two 
types with an implementation written in Magma. The results are in figure 21 we plot 
a line for the average running time of the algorithm and bars around it for minimum 
and maximum execution times of the final loop. Besides the dramatic speedup obtained 
by using the ad- hoc type GF2, the algorithmic improvements of FAAST over Magma are 
evident as even zz_p is one order of magnitude faster. 

Tablc[T]shows detailed timings for each phase of C2-AS-FI-MC. The column FI reports 
the time for one interpolation, the column MC the time for one modular composition; 
comparing these two columns the gain from passing from C2-AS-FI to C2-AS-FI-MC is 
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Tabic 1: Comparative timings for the phases of C2-AS-FI-MC for curves over F 2 ioi ■ 
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Figure 5: Comparative timings for C2-AS-FI-MC (C2) and LS over different curves. Plot in logarithmic 
scale. 



evident. Columns RFR (rational fraction reconstruction) and MC constitute the Cauchy 
interpolation step that is repeated in the final loop. The last column reports the average 
time spent in the loop: it is by far the most expensive phase and this justifies the attention 
we paid to FI and MC; only on some huge examples we approached the crosspoint between 
these two algorithms. 

Next, we compare the running times of C2-AS-FI-MC and LS over curves of half the 
cryptographic size in figure [S] (left). We only plot average times for C2, in characteristic 
2 we only plot the timings for GF2. From the plot it is clear th at C2- AS-FI-MC only 
performs better than LS for p = 2, but in this case the algorithm of |Ler96| is by far better. 
Figure [5] (right) shows that LS slowly gets worse than C2, however comparing a Magma 
prototype to our highly optimised implementation of C2-AS-FI-MC is somewhat unfair 
and probably the crosspoint between the two algorithms lies much further. Furthermore, 
it is unlikely that C2-AS-FI-MC could be practical for any p > 3 because of its high 
dependence on p, while LS scales pretty well with the characteristic as shown in figure [5] 

We can hardly hide our disappointment concluding that, despite their good asymp- 
totic behaviour and our hard work implementing them, the variants derived from C2 
don't seem to have any practical application, at least for present data sizes. We hope 
that in the future the algorithms presented here may turn useful to compute very large 
data that arc currently out of reach. 
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